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Abstract 

We consider processes on social networks that can potentially involve 
three factors: homophily, or the formation of social ties due to matching 
individual traits; social contagion, also known as social influence; and the 
causal effect of an individual's covariates on their behavior or other mea- 
surable responses. We show that, generically, all of these are confounded 
with each other. Distinguishing them from one another requires strong as- 
sumptions on the parametrization of the social process or on the adequacy 
of the covariates used (or both). In particular we demonstrate, with sim- 
ple examples, that asymmetries in regression coefficients cannot identify 
causal effects, and that very simple models of imitation (a form of social 
contagion) can produce substantial correlations between an individual's 
enduring traits and their choices, even when there is no intrinsic affinity 
between them. We also suggest some possible constructive responses to 
these results. 
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1 Introduction: "If your friend jumped off a 
bridge, would you jump too?" 

We all know that people who are close to each other in a social network are 
similar in many ways: they share characteristics, act in similar ways, and similar 
events are known to befall them. Do they act similarly because they are close in 
the network, due to some form of influence that acts along network ties (or, as it 
is often suggestively put, "contagion" ^)? Or rather are they close in the network 
because of these similarities, through the processes known assortative mixing on 
traits, or more simply as homophily (McPherson et ai, 2001)? Suppose that 
there are two friends named Ian and Joey, and fan's parents ask him the classic 
hypothetical of social influence: "If your friend Joey jumped off a bridge, would 
you jump too?" Why might Ian answer "yes"? 

1. Because Joey's example inspired Ian (social contagion/influence); 

2. Because Joey infected Ian with a parasite which suppresses fear of falling 
(biological contagion) ; 

3. Because Joey and Ian are friends on account of their shared fondness for 
jumping off bridges (manifest homophily, on the characteristic of interest); 

4. Because Joey and Ian became friends through a thrill-seeking club, whose 
membership rolls are publicly available (secondary homophily, on a differ- 
ent yet observed characteristic); 

5. Because Joey and Ian became friends through their shared fondness for 
roller-coasters, which was caused by their common thrill-seeking propen- 
sity, which also leads them to jump off bridges (latent homophily, on an 
unobserved characteristic) ; 

^Analogies between the spread of ideas and behaviors — especially disliked ideas and 
behaviors — and the spread of disease are ancient. PUny the Younger, for instance, referred 
to Christianity as a "contagious superstition" in a letter to the Emperor Trajan in 110 (Epistles 
X 96). Siegfried (1960/1965) gives further examples. The best treatment of this analogy is 
made by Sperbor (1996). 



2 



6. Because Joey and Ian both happen to be on the Tacoma Narrows Bridge 
in November, 1940, and jumping is safer than staying on a bridge that is 
tearing itself apart (common external causation). 

The distinctions between these mechanisms — and others which no doubt 
occur to the reader — are all ones which make causal differences. In particular, 
if there is any sort of contagion, then measures which specifically prevent Joey 
from jumping off the bridge (such as restraining him) will also have the effect 
of tending to keep Ian from doing so; this is not the case if contagion is absent. 
However, the crucial question is whether these distinctions make differences in 
the purely observational setting, since we are usually not able to conduct an 
experiment in which we push Joey off the bridge and see whether Ian jumps (let 
alone repeated trials.) 

The goal of this paper is to establish that these are, by and large, phenom- 
ena that are surprisingly difficult to distinguish in purely observational studies. 
More precisely, latent homophily and contagion are generically confounded with 
each other (section 2), and any direct contagion effects cannot be nonparametri- 
cally identified from observational data^. To identify contagion effects, we need 
either strong parametric assumptions or strong substantive knowledge that lets 
us rule out latent homophily as a causal factor. It has been proposed that 
asymmetries in regression estimates which match asymmetries in the social net- 
work would let us establish direct social contagion; we show (Section 2.2) as a 
corollary of our main result that this also fails. 

We realize that many issues with unobservable characteristics exist in many 
observational study settings, not just in those that share our explicit focus on 
network phenomena, yet our investigations of social contagion are not driven by 
some animus; we are just as concerned for those investigations that ignore net- 
work structure when it is present. If contagion works along with homophily, we 
show that it confounds inferences for relationships between homophilous traits 
and outcome variables such as observed behaviors (Section 3). In particular, 
even when the true causal effect of the homophilous trait is zero, the trait can 
still act as a strong predictor of the outcome of interest merely through the 
outcome's natural diffusion in a network (Section 3.1). 

We also realize that our main findings are negative, and implicitly critical of 
much previous work. Section 4 suggests some possible constructive responses to 
our findings, while Section 5 concludes with some methodological reflections. 

^We remind the reader of the relevant sense of "identification" (Manski, 2007). We have 
a collection of random variables, which are generated by one causal process M out of a set 
of possible processes M. Not all aspects of this process are recorded, and the result is a 
distribution P over observables. Each M leads to only one distribution over observables, 
P{M). A functional of the data-generating process is identifiable if it depends on M only 
through P{M), i.e., if 6»(M) ^ e(M') implies P(M) ^ P{M'). Otherwise, the functional 
is unidentifiable. If d is identifiable only when M is restricted to a finitely parameterized 
family, then d is parametrically identifiable (within that family). If is identifiable without 
such a restriction, it is non-parametrically identifiable. See further Pearl (2009b, ch. 3) on 
identification of causal effects from observables. 
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1.1 Notation, Terminology, Conventions 



In our framework the random variable Xi is a collection of unchanging latent 
traits for node i; similarly, is a collection of static observed traits. Both X 
and Z may be discrete, continuous, mixtures of both, etc. The social network 
is represented by the binary variable Aij , which is 1 if there is a (directed) edge 
from i to j — that is, i considers j to be a "friend" — and otherwise. Time 
t advances in discrete steps of equal duration; this is inessential but avoids 
mathematical complications. Yi{t) denotes a response variable for node i at 
time t; again, whether categorical, metric or otherwise doesn't matter. (We will 
sometimes write this as Y{i,t) or even Ya, as typographically convenient, and 
likewise for other indices.) These variables are also listed in Figure 2, alongside 
a graphical representation of the prototypical process we are examining. 

We conducted all simulations in R (R Development Core Team, 2008). Our 
code is available from http : //www. stat . emu. edu/~cshalizi/homophily-conf ounding/. 

2 How Homophily and Individual-Level Causa- 
tion Look Like Contagion 

The members of a social network often exhibit correlated behavior. When we 
speak of contagion or influence within networks, we imply that conditioning on 
all other factors, there will be a temporal relationship between the behaviour of 
individual i at time t and any neighbours of i (potential j's) at the previous time 
point. This is easiest to see when all other causes of adoption of a trait aside from 
the network itself are eliminated, such as person-to-person infectious diseases 
(Bartlett, 1960; EUner and Guckenheimer, 2006; Newman, 2002), though other 
examples include the spread of innovations (Rogers, 2003). 

More puzzling are situations such as the investigation of Christakis and 
Fowler (2007), where the behavior that apparently spreads through the net- 
work is "becoming obese" , as obesity is not normally thought of as an infectious 
condition'^, or the apparent spread of "happiness", documented by Fowler and 
Christakis (2008). It is natural to ask how much of such "network autocorre- 
lation" — the tendency of these behaviors to be correlated in individuals that 
are closely connected — is due to some direct influence of i's neighbors on i's 
behavior, as opposed to the effect of homophily, in which social ties form be- 
tween individuals with similar antecedent characteristics, who may then behave 
similarly as a result^. 

Social network scholars have long been concerned with this issue, under the 
label of "selection versus influence" or "homophily versus contagion" (Leenders, 

■^There are however claims in the medical literature (Atkinson, 2007) that certain viruses 
induce obesity in rodents and may contribute to the condition in human beings. (Thanks to 
Matthew Berryman and Gustavo Lacerda for bringing this to our attention.) We lack the 
knowledge to assess the soundness of these claims, let alone their plausibility as explanations 
of human obesity. 

^Sperber (1996, ch. 5) is a detailed and subtle exploration of just how powerful the latter 
mechanism can be, and how it can interact with imitation or contagion. 
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1995), usually with regard to manifest homophily but certainly not limited to 
it. To give just one example of a sophisticated recent attempt to divide the 
credit for network autocorrelation between homophily and contagion, consider 
Aral et al. (2009). (The following remarks apply, with suitable changes, to 
many other high-quality studies, e.g., Bakshy et al. (2009); Anagnostopoulos 
et al. (2008); Yang et al. (2007); BramouUc et al. (2009).) They worked with 
a uniquely obtained data set with a clear outcome measure: the adoption of 
an online service over time, with users of an instant messaging service as the 
(extremely large) community of interest. To separate the effects of contagion 
from those of homophily, a large and rich table of covariates on an individ- 
ual's personal and network characteristics was assembled (with 46 covariates 
in total), and matched pairs were assembled using propensity score estimation 
(Rosenbaum and Rubin, 1983) so that one member of the pair had, at one 
point, exposure to the online service through one (or more) of their network 
neighbors; assuming that these characteristic differences had then been teased 
out, the difference in the adoption rate would then reflect the total proportion of 
the adoption by contagion, allowing for an estimate of the proportion of associ- 
ation that is attributable to contagion, as opposed to the proportions caused by 
homophily, either secondary (in terms of the 46 observed network characteris- 
tics) or manifest (caused by two users becoming friends specifically due to their 
connection on the online service) - but notably, not latent homophily, which 
may still remain as a component of the so-called "contagious" proportion; this 
is due to the nature of propensity score matching, which can simplify the re- 
lationships between observed properties and the adoption of a "treatment" (in 
this case, network-localized exposure to the service), the effort may prove to be 
inadequate if any unobserved covariates have a part in both tie selection and in 
service adoption. 

This brings us to our fundamental point: to attempt to assign strengths to 
influence or contagion as opposed to homophily presupposes that the distinc- 
tion is identifiable, and there have been grounds to doubt this for some time. 
Manski (1993), in a well-known paper, considered the related problem of the 
identification of group effects: supposing that an individual's behavior depends 
on some individual-level predictors and on the mean behavior of the group to 
which they belong, can the degree of dependence on the group be identified? He 
showed that in general the answer is "no" , unless you make strong parametric 
assumptions, and perhaps not even then (since group effects can fail to be iden- 
tified even in linear models). Indeed, this has been shown to cause difficulties 
in other social situation where this sort of phantom influence can be observed: 
among others, Calvo-Armengol and Jackson (2009) note that estimating the 
apparent effect of parental influence on their child's educational outcomes is 
confounded by the actions of the larger community. (See Blume et al. 2010 for 
a recent review of the group-effects literature.) However, this does not quite 
answer our questions, since Manski considered influence from the group aver- 
age, rather than from individual members of the network neighborhood, and 
one could hope this would provide enough extra information for identification. 

We now show that, in fact, contagion effects are nonparametrically uniden- 
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tifiable in the presence of latent homophily — that there is just no way to 
separate selection from influence observationally. Our proof involves some sim- 
ple manipulations of graphical causal models; we refer the reader to standard 
references (Spirtes et ai, 2001; Pearl, 2009b, a; Morgan and Winship, 2007) for 
the necessary background. 

2.1 Contagion Effects are Nonparametrically Unidentifi- 
able 

We first assume that there is latent homophily present in the system: the net- 
work tie Aij is influenced by the unobserved traits of each individual, Xi and 
Xj. We assume that the "past" observable outcome li(t — 1) has a direct influ- 
ence on the same outcome measured in the present, Yi{t).^ We also assume that 
Xi directly influences Yi{t) for all t, though possibly not to the same magnitude 
or mechanism at each time t.^ Finally, we assume that another individual's 
prior outcome Yj{t — 1) can directly influence Yi{t) only if Aij = 1 — that is, 
there must be an edge present for this direct influence to occur. We are in- 
different as to whether the observable covariates Zi have a direct influence on 
Yi{-), or whether it is correlated with the latent covariates Xi. The upshot of 
these assumptions is the causal graph in Figure 1, examination of which should 
make it unsurprising that contagion, the direct influence oi Yj(t — 1) on Yi(t), 
is confounded with latent homophily: 

• Yj{t — 1) is informative about Xj] 

• Xj is informative about Xi when i and j are linked {Aij = 1); and 

• Xi is informative about Yi{t). 

Thus Yi{t) depends statistically on Yj{t — 1), whether or not there is a direct 
causal effect of contagion present. 

While this argument would appear to be loosely assembled, it can be tight- 
ened up using the familiar rules for manipulating graphical causal models (Spirtes 
et ai, 2001; Pearl, 2009b). Xi d-separates Yi{t) from Aij. Since Xi is latent 
and unobserved, Yi{t) ^ Xi — Aij is a confounding path from Yi{t) to Aij. 
Likewise Y, (i — 1) — > Aij is a confounding path from 1^(^—1) to Aij. 
Thus, Yi{t) and Yj{t — 1) are d-connected when conditioning on all the observed 
(boxed) variables in Figure 1. Hence the direct effect of Yj (i — 1) on Yi{t) is not 
identifiable (Pearl, 2009b, §3.5, pp. 93-94). 

This argument is not affected by adding conditioning on Yi{t ~ 1) or Yj[t), 
as that does not remove the confounding paths. Nor does adding conditioning 
on Zi, Zj remove the confounding. Nor is the situation helped by allowing Aij, 

^The results of this investigation hold even if this assumption is dropped, or if the time 
dependence goes beyond the first order; that is, Yi(t — k) continues to influence Yiit) even 
after controlling for Yi{t — 1). 

^The result will go through so long as yi(to) is influenced by Xi for at least one to, and 
for the subsequent observation tyto. 
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Figure 1: Causal graph allowing for 
latent variables {X) to influence both 
manifest network ties Aij and manifest 
behaviors (Y). 



Figure 2: Notational guide to 
terms used in this investigation. 



or indeed X, to vary over time, as is readily verified by drawing the appropriate 
graphs. Finally, adding a third individual to the graph would not help: even if 
they were, say, assumed to be linked to i but not j or vice versa, Yi(t) ^ Xi ^ 
Aij and Yj{t — 1) Xj — > Aij would remain confounding paths. 

How then might we get identifiability? It may be that very stringent para- 
metric assumptions would suffice, though we have not been able to come up 
with any which would be suffice^ Otherwise, we must keep X from being la- 
tent, or, more precisely, either the components of X that infiuence Y must be 
made observable (Figure 3a) , or those parts of X which infiuence the social tie 
formation A (Figure 3b). In either case the confounding arcs go away, and the 
direct effect of Yj{t — 1) on Yi{t) becomes identifiable.** It is noteworthy that 
the most successful attempts at explicit modeling that handle both homophily 
and influence, as found in the work of Leenders (1995); Steglich et al. (2004) 
involves, all at once, strong parametric (exponential-family) assumptions, plus 
the assumption that observable covariates carry all of the dependence from X 
to Y and A; the latter is also implicitly assumed by the matching methods of 
Aral et al. (2009). 

Whether we face the unidentifiable situation of Figure 1, or the identifiable 
case of Figure 3, currently depends upon subject-matter knowledge rather than 
statistical techniques. It may be possible to adapt algorithms, such as those in 

^In particular, making all of the relations between continuous variables in Figure 1 linear, 
with independent noise for each variable, is not enough — the confounding path continues to 
prevent identifiability even in a linear model. 

^Elwcrt and Christakis (2008) is another interesting approach. In effect, they introduce 
a third node, call it fc, where they can assume that Yi is not influenced by Y^, but the 
homophily is the same. Estimating the apparent influence of Y^ on Yi then shows the extent 
of confounding to due purely to homophily; if Yi is more dependent than this on Yj , the excess 
is presumably due to actual causal infiuence. 
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Figure 3: Modifications of tfie causal grapfi sliown in Figure 1, in whicli observ- 
able covariates (Z) conveys enough information about X that contagion effects 
are unconfounded with latent homophily. In a (left), Z carries all of the causal 
effect from X to the observable outcome Y; in b (right), Z carries all of the 
effect from X to the social network tie A. 

Spirtes et al. (2001), to detect the presence of influential latent variables. Some 
new methodological work would be required, however, since all such algorithms 
known to us rely strongly on having a supply of independent cases, and social 
networks are of interest precisely because individuals, and even dyads, are not 
independent. 

2.2 The Argument from Asymmetry 

A clever argument for the presence of direct influence was introduced by Chris- 
takis and Fowler (2007). By focusing on unreciprocated directed edges — pairs 
(i, j) where Aij = 1 but Ajt = 0, so that j's prior outcome can be said to influ- 
ence j's present, but not i's prior outcome on j's present — one can consider the 
distributions of the outcomes conditional on their partner's previous outcome, 
Yi{t)\Yj[t — 1) and Yj{t)\Yi{t — 1) (though other observable covariates {Zi,Zj) 
may also be conditioned on.) An asymmetry here, revealed by the difference in 
the corresponding regression coefficients, might then be due to some influence 
being transmitted along the asymmetric edge, and not due to external common 
causes (such as a new fast food restaurant) or other behaviours attributable to 
latent characteristics. 

This idea has considerable plausibility and has been picked up by a number of 
other authors (Anagnostopoulos et al., 2008; Bramoulle et al., 2009), who have 
shown that it works as a test for direct influence in some models. However, we 
show that the argument can break down if two conditions are met: first, the 
influencers (the j in the pair) differ systematically in their values of X from 
the influenced (the i), and, second, different neighborhoods of X have different 
local (linear) relationships to Y . As previously mentioned, the most successful 
claims of simultaneous accounting of these phenomena require strong parametric 
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assumptions, and our demonstration shows that even assumptions of Hnearity 
may be too strong for this sort of data. 

To illustrate this claim, we present a toy model of a network with latent 
homophily on an X variable that controls an observable time series Y at multiple 
points, but with no direct influence between values of Y for different nodes. We 
present this as a multi-step time series to approximate the scenario of Christakis 
and Fowler (2007), so that we can add the two most recent time steps of the 
alter's expression into the regression.^ We also note that there is no "coupled 
evolution" of two nodes' outcomes due to an exogenous common cause, one of 
the stated purposes of the asymmetry test. Despite the lack of direct interaction, 
it is possible to predict Yi at time t from the value of Y at its neighbors for 
times t — 1 and t — 2, and these relations are asymmetric across unreciprocated 
edges. 

First we present the formation of the network, which contains n individuals 
(nodes), and each node i has a scalar latent attribute Xi ^ U{0,1), which 
are generated independently. We generate an underlying undirected network 
(a potential friendship pool) where such an edge forms between i and j with 
probability equal to logit~^(— 3|Xi — Xj\), so that edges are more likely to 
form between individuals with similar values of X. Each individual i then 
nominates their "declared" friendships from these neighbors, naming j with 
probability proportional to oc logit~^(— |Xj — 0.5|) — individuals, whatever their 
own value of X, prefer to nominate acquaintances closer to the median value 
of that trait. For this demonstration, as in the data sets used in Christakis 
and Fowler (2007); Fowler and Christakis (2008), each individual i declares 
one friend, though the results hold for greater numbers of nominations. This 
produces the sociomatrix/adjacency matrix A, where Aij = 1 signifies that 
individual i has nominated j as a "friend" . 

Second, we establish the time trends of the observable outcomes {Yi{t = 
0),l-(i = l)): 

• At time i = 0, we set Y,{0) = (X, - 0.5)^ + 7V(0, (0.02)^), a nonlinear 
assignment of outcome attributes. 

• For time t ^ 1, we set ^^(1) = r,(0) + OAX, + A/'(0, (0.02)^), so that 
the trend is greater for those individuals with higher values of the latent 
attribute. 

^The method in Christakis and Fowler (2007) uses a "simultaneous" regression set-up, 
including Yj{t) as a predictor of Yi(t) as well as a previous time point Yj{t — 1). Treated at 
face value, this can produce an incoherent probability distribution for the evolution of the 
system (Lyons, 2010), as well as implying a scarcely-comprehensible notion of simultaneous 
causation (rather than coupled behavior or feedback); this can be somewhat salvaged by 
considering it as an observation that shares information from the minus one-half" time 
point, as well as picking up any coupled behaviour at time t. 

^"Whether this is an actual bias in the social network formation process, or merely a part 
of the process recording the network, does not matter. Also, results would work equally well 
if ties were biased towards extreme rather than central values of X, for multivariate latent 
traits, and so forth. 
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Figure 4: Graphical causal model for our simulation study in section 2.2. Here, 
unlike Figure 1, there are no arrows from {Yj{t — 2),Yj{t — 1)) to Yi{t) — the 
former outcomes for the "alter" are not, in reality, a cause of the latter for the 
"ego", and the relationships of the Yj and Yi time series are symmetrical. As 
we show in the text, however, not only Yi{t) predictable from Yj(t — 1), but 
the relationship is asymmetric when social network ties are unreciprocated, i.e., 
Aij = 1 but Aj.^ = 0. 

• For time i = 2, we set Y,{2) = 1^(1) +0.4X, +7V(0, (0.02)2), repeating the 
trend. 

Figure 4 is the graphical model for the actual causal structure of our simu- 
lation at three time points. 

We simulate a network of fixed size (n = 400) from this model and estimated 
the linear model 

Y,{2) = a + /3iy,(l) + /32 5]A,,yj(l)+/?3^Aj,F,(l) + 

i j 

fii J2 ^^jY, (0) +kY. ^^-^1 (0) + «^ ' 

so that a represents the intercept term and /3i represents the autocorrelation; 
jii is the effect of the nominee's status at time < = 1 on the nominator, and fij, is 
the converse, the network effect if i was nominated by j, at time i = 1; /34 and /Ss 
are those same coefficients for the outcome at time i = 0. This was replicated 
5000 times, with the latent variables, time series and network regenerated in 
each replication. 

Figure 5 shows the results of these simulations. Figure 5a shows the mag- 
nitude of /32, the coefficient of network influence; in 4010 of these 5000 trials is 
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Effect of Phantom 'Influencer' on 'Influenced' In Time Series 



Effect of Phantom 'Influenced' on 'Influencer* In Time Series 



-0.10 -0.05 0.00 0.05 0.10 

Regression Coefficient, Mean = 0.0257 



-0.02 0.00 0.02 0.04 

Regression Coefficient, Mean = 0.00192 



Mutual Effect of Phantom Influence 



Directional Difference of Phantom Influence 



-0.10 -0.05 0.00 0.05 0.10 

Regression Coefficient Sum, Mean = 0.0276 



-0.10 -0.05 0.00 0.05 0.10 0.15 

Coefficient Difference, Mean = 0.0237 , Prop>0: 0.775 



Figure 5: Results for a toy model where a latent variable causes spurious time- 
dependent network effects. Clockwise from the top left: a) The estimate for 62. 
the effect in the expected direction of influence, b) The estimate for /^s, the ef- 
fect in the opposite direction of influence (from the namer to the named), c) The 
sum of the estimated effects, indicating that the effect for a mutual tie (in which 
each respondent names the other) is greater than either the expected or oppo- 
site unreciprocated tie effect, d) The normalized difference between directional 
effects is clearly greater than zero on balance (in roughly 77% of simulations), 
suggesting that the asymmetry in coeflScient estimates can be produced without 
contagion and falsely detected by t-tests on the difference. 
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the estimate less than zero despite the lack of a direct connection, in Hne with 
the empirical results of Cohen-Cole and Fletcher (2008). This is also the case 
for Figure 56, showing the apparent coefficient of a "reverse" network effect Ps, 
which is smaller in magnitude. Figure 5c shows the sum of the two effects; this 
demonstrates that the effect of a mutual tie, where A^jAji = 1, is determined 
by the sum of the one-way effects and is greater than the effect of a "named" 
tie, Aij = 1, which is greater than the effect of a "naming" tie, Aji = 1. This 
is the result of the type that was cited in Christakis and Fowler (2007); Fowler 
and Christakis (2008) but produced without any network interaction.^^ 

Figure 5c? shows the difference between the "sender" and "receiver" coeffi- 
cients, which would be approximately Gaussian (for a i-distribution with 400 
degrees of freedom) and centered at zero, if this were the case, a t-test could be 
used to claim statistical significance in the difference between the two effects. It 
is evident from the histogram that this null distribution is not centered at zero, 
and about 77% of the sample values are positive, even though there is really no 
effect. Thus, latent homophilous variables can produce a substantial apparent 
contagion effect, including the asymmetry expected of actual contagion. 

The parameter values in this model were not chosen to maximize either 
the apparent contagion effect or its asymmetry, merely to demonstrate their 
presence. As well, we note that controlling for additional past values of the 
property for each node reduces the imbalance in magnitude, while it still remains 
statistically significant; as we show in Section 4.2, this is not the end of the story 
if we cannot find a bound for this asymmetry. 

Additionally, it may seem unlikely that these conditions may exist on unob- 
served variables in the system, but this still places the burden on the investigator 
to pursue as many possible latent factors as may be present — an extremely 
onerous task in a multi-decade observational study — or to work exclusively 
with experimental data, such as in the recent work of Fowler and Christakis 
(2010). 

3 How Contagion and Homophily Look Like Cau- 
sation At The Individual Level 

We would be remiss if we gave the impression that it is only investigators who 
actually take network structure into account who have problems. In this section, 
we show that a very common kind of use of survey data, namely that relating 
individual's choices (cultural, political, economic, etc.) to their long-term stable 
traits, is also confounded in the presence of homophily and contagion. Continu- 
ing the spirit of Section 2.2, we present another toy model in which regressions 
of choices on traits produce significant non-zero coefficients that are solely due 

^^There is also the notion of a "bonus" effect for mutual ties, /34 A^j Y, (0), which 
could provide an additional bump for mutuality that would indicate a stronger tie than simply 
indicated by a binary specification. We leave this for another investigation, noting that the 
mutual > named > namer relation is satisfied without adding this term. 
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Figure 6: Typical situation in surveys linking cultural choices to social traits 
when homophily and influence exist. 

to this confounding.^^ 

It should be emphasized that there is a long tradition within social science 
of distinguishing long-term, hard-to-change aspects of social organization and 
individuals' place in it, from more short-term, malleable aspects which show up 
in behavior and choices. As Ernest GcUner (1973) put it, "Social structure is who 
you can marry, culture is what you wear at the wedding." The long-standing 
theoretical presumption, common to all the classical sociologists (even, in his 
own way, to Max Weber), and going back through them to Montesquieu if not 
beyond (Aron, 1989), is that social structure explains culture, or that the latter 
reflects the former; in many versions, culture is an adaptation to social structure. 
This intuition is alive and well through the social sciences, the humanities, and 
among lay people. Many of these accounts have considerable plausibility, though 
since they conflict with each other they cannot all be true. However, aside from 
casual empiricism, the evidence for them consists largely of correlations between 
cultural choices and social positions, demonstrations that the superstructure can 
be predicted from the base. Famously, for instance, Bourdieu (1984) attempts 
to do this for survey data. 

We do not wish to assert that social position is never a cause of cultural 
choices; like everyone else, we think that it often is. The issue, rather, is the 
evidence for such theories, and in particular for the magnitude of such effects. 

3.1 Simulation Model 

We work with what is, frankly, a toy model of contagion (though, see footnote 
13 below). There are n individuals connected in an undirected social network. 

^^Preliminary versions of these results appeared in Shalizi (2007), and as long ago as 2005 
at http://bactra.org/notebooks/neutral-cultural-networks.html. We understand from a 
presentation by Prof. Miller McPherson that he and colleagues have been working on parallel 
lines, and will soon publish a demonstration that biases of this sort can be quite substantial 
even for the canonical General Social Survey (M. McPherson, "Social Effects in Blau Space", 
presentation at MERSIH 2, 14 November 2009). 
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Figure 7: Graphical model showing the causal structure of the model simulated 
in Section 3.1; cf. Figure 6. Notice that here, the persistent traits X have no 
direct causal influence on the choices Y . As we show, however, diffusion of 
choices along homophilous ties creates states where Y can be predicted from X. 



Each individual i has an observed trait Xi which is an unchanging variable; in 
our examples, this will be binary. The network is homophilous on this trait, 
so that individuals with the same value of X are more likely to be connected. 
Individuals also have a time- varying choice variable Yi{t)^ which again we will 
take to be binary. The initial choices, li(0), are set by flipping a fair coin (i.e. 
an unbiased Bernoulli process) , and are therefore independent of the traits Xi . 

Choices evolve as follows: at each time we pick an individual /t, uniformly 
at random from i S {1, ...,n}, independently of all prior events. This individual 
then picks a neighbor, again uniformly at random, Jt G {j : Aj^j = 1}, and 
either, with very high probability, copies their choice, so that Yj^ [t) = Yj^ — 1), 
or, with very low probability, assumes the opposite choice, for Yj^ (t) = 1—Yj^ (t— 
1); all other individuals retain their previous choices. This process repeats for 
each time step. Figure 7 shows the causal structure. 

This random copying model is, of course, a drastic oversimplification of 
actual processes of transmission and influence, which have been extensively 
studied in social psychology and allied fields since the 1920s (Bartlett, 1932; 
Sperber, 1996; Huckfeldt et ai, 2004; Friedkin, 1998). However, not only is 
it adequate to demonstrate the existence of the phenomenon we are concerned 
with, its very abstraction helps indicate just how robust the problem is. 

Probabilistically, the vector Y{t) is a Markov chain, specifically a variant of 
the "voter model" of statistical mechanics on a graph (Liggett, 1985; Sood and 
Redner, 2005); the minor addition of low- frequency noise (doing the opposite 
of the selected neighbor) keeps the homogeneous configurations (where Yi is 
constant over i) from being absorbing states, but has little influence on the 



^■^Notice that the expected value of Yj^ (t + l) is just the mean of Yj (t) for the j neighboring 
It- The expected value of Yi(t + 1) for all i is thus a weighted average of Yi[t) and the mean 
of their neighbors. At the level of expectations, then, this process belongs to the family of 
linear social influence models used in, e.g., Friedkin (1998). . 
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medium-run behavior we are concerned with. 

Figure 8 shows a typical evolution of this model. In the top image at the 
initial state of the system, there are two clusters based on social traits X, but 
the individual cultural choices (colors represent values of Y) are independent of 
these traits. The bottom image shows the same network and configuration after 
3000 updates. Now, even by eye, it is clear that one of the choices has become 
associated with one of the social types. 

This can be confirmed more quantitatively by doing a logistic regression of 
choice on trait (Figure 9) at several points during the diffusion process. In this 
particular example, there are significant deviations in each direction. First, the 
association between trait 1 and color 1 is positive and significant, and remains 
so for several dozen iterations; then the diffusion reverses the association, which 
then becomes negative and significant. For comparison, a network with the 
same average degree but no homophilous tie formation is shown to undergo the 
same diffusion process but with no corresponding association between choice 
and trait. 

Intuitively, the copying process tends to make neighbors more similar to each 
other; Tan's choice can be predicted from Joey's choice. On regular lattices, this 
mechanism causes the voter model to self-organize into spatially-homogeneous 
domains, with slowly shifting boundaries between them (Cox and Griffeath, 
1986). A similar process is at work here, only, owing to the assortative nature 
of the graph, neighbors tend to be of the same social type. Hence social type is 
an indirect cue to network neighborhood, and accordingly predicts choices. 

To summarize, this "neutral" process of diffusion, together with homophily, 
is sufficient to create what looks like a causal connection between an individual's 
social traits and cultural choice. This is because individuals' choices are not 
independent conditional on their traits, as is generally assumed in, e.g., survey 
research; diffusion creates the observed dependence. 

This demonstration shows that it is difHcult to argue that, for example, being 
of type is an indirect cause of picking the color black as opposed to red, since 
even within a single run of the model the association can be seen to reverse. 
Put another way, differences in social types are at most related to differences in 
choices, not to the actual content of those choices. 

4 Constructive Responses 

To sum up the argument so far, we have shown that latent homophily together 
with causal effects from the homophilous trait cannot be readily distinguished, 
observationally, from contagion or influence, and that this remains true even 
if there is asymmetry between "senders" and "receivers" in the network. We 

^*Note that the standard errors are from the isolated logistic regression at each time point; 
when taken collectively, the errors in the effect size would be different. Our point remains 
that this would be the effect size estimated if the time evolution were not properly accounted 
for. 

^^It should be clarified here that the problem is notthe ecological fallacy, or a red-state/blue- 
state issue, (Gelman et al, 2008) since the simulation is not aggregating any data. 
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Figure 8: An illustration of the diffusion process on a network with homophilous 
ties; members of the left and right clusters have attribute values of and 1 
respectively. Initially (top), there is very little detectable similarity between 
choices within each cluster; however, after a few hundred time steps (bottom), 
there is a clear association between trait and cluster caused entirely by the 
diffusion along homophilous ties. 
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Figure 9: Coefficient estimates for logistic regressions of choice on trait as func- 
tions of time. Error bars represent 95% confidence intervals on each run, in- 
dependent of all others. Left: the evolution in a homophilous network; in this 
run of the simulation, the coefficient first becomes negative and statistically 
significant, then becomes positive and significant, purely due to diffusion along 
homophilous ties, before returning to a state of negative significance. Right: a 
corresponding series of estimates in a network where ties form independently of 
traits; deviations from neutrality are much smaller. 

have also shown that the combination of honiophily and contagion can imitate 
a causal effect of the homophilous trait. It requires little extra to see that con- 
tagion, plus a causal influence of the contagious trait, yields a network that 
contains the appearance of homophily. Thus, given any two of homophily, con- 
tagion, and individual-level causation, the third member of the triad seems to 
follow. 

We realize that these results appear to wreck the hopes on which many ob- 
servational studies of social networks have rested. It would be nice to think that 
something could, nonetheless be salvaged from the ruins. The "easy" solution is 
to use expert knowledge of the system to identify all causally relevant variables, 
measure a sufficient set of them, and adjust for them appropriately (Morgan and 
Winship, 2007; Pearl, 2000; Spirtcs et ai, 2001). Since this is clearly a Utopian 
proposal, we sketch three constructive responses which may be possible when 
dealing with network data where the causal structure is imperfectly understood 
or incompletely measured. These are to randomize over the network, to place 
bounds on unidentifiable effects, and to use the division of the network into 
communities as a proxy for latent homophily. 

4.1 Identifying Contagion from Non-Neighbors 

The essential obstacle to identifying contagion in the setting of Figure 1 is that 
the presence or absence of a social tie Aij between individuals i and j provides 
information on the latent variable Xi, whether we implicity include the tie by 
predicting Yi{t) from the past values of neighbors Yj(t— 1) or we explicitly add 
Aij to the prediction model. In the language of graphical models, conditioning 
or selecting on Aij "activates the collider" at that variable. This suggests that 
we would do better, in some circumstances, to construct a useful inference by 
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deliberately not conditioning on the social network, thereby keeping the collider 
quiescent. We outline this method to demonstrate the possibility, rather than 
to advocate a new prescription for solving the problem. 

We can conduct the following procedure over many repeated trials: 

1. Divide the nodes into two groups, by assigning each node to one of two 
bins with equal probability; let these groups be labelled as Ji and J2. 

2. Let Yjj [t) be the vector-valued time series obtained by collecting each of 
the Yi{t) for i G Ji into one object, and similarly for ^/^(f). 

3. Use some available mechanism to predict the time series for the first bin, 
Yjj(t). from its lagged counterpart, Yj2{t — 1), while controlling for the 
previous time point within the first half, Yj-^{t — 1). 

By repeating this procedure, then averaging over all iterations (producing 
new partitions each time), there will be a non-zero predictive ability if and only 
if there is actual contagion or influence. We can see why one must average 
over multiple divisions as follows. Clearly, influence is possible between the two 
halves only if there are social ties linking them. However, there will generally 
exist some way of picking Ji and J2 so that there are no linking ties, and in 
the presence of homophily, those will tend to be divisions of the network into 
parts which are unusually dissimilar in their homophilous traits. If we restricted 
ourself to values of Ji and J2 which did have linking ties, we would once again 
be selecting on the homophilous trait and activating colliders. 

This may not be a practical method, as the statistical power of this test may 
be very low — the data have very high dimension, and the method deliberately 
selects random predictors — but it will be non-zero. 

Even the random-halves test will fail, however, if we add a direct causal effect 
of Xj on Yi{t) (or one modulated by Aij). We omitted such a link in Figure 
1 and subsequently, on the assumption that causal effects between individuals 
must pass through observed behavior Y, but this is a non-trivial substantive 
hypothesis requiring rigorous justification. 

4.2 Bounds 

In Sections 2 and 3, we saw that certain causal effects were not identifiable; 
that different causal processes could produce identical patterns of observed as- 
sociations. As Manski (2007) emphasizes, even when parameters (such as the 
causal effect of Yj{t — 1) on Yi{t)) are observationally unidentifiable, the dis- 
tribution of observations may suffice to bound the parameters. (With sampled 
data, the empirical distribution of observations generally provides estimators of 
those bounds.) Sometimes these bounds can be quite useful, even in the general 
non-parametric case. 

We thus propose as a topic for future research placing bounds on the causal 
effect of Yj{t — 1) on Yi(t) in terms of observable associations, assuming the 

^^Thanks to Peter Spirtes and Richard Scheines for making this paradoxical suggestion. 
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structure of Figure 1. If the bound on this effect excluded zero, that would 
show the observed association could not be due solely to homophily, but that 
some contagion must also be present. 

If we keep the causal structure of Figure 5, assuming that the Y and X 
variables are all jointly Gaussian and all relations between continuous variables 
are linear^^ would let us employ the usual rules for linear path diagrams (Spirtes 
et al., 2001). The standardized linear-model coefficient for regressing senders 
on receivers, i.e., Yi{t) on Yj{t — 1), controlling for all other observables, turns 
out to be 

p[Xj,Y,{t-l)]p[X,,Xj\A,j = l]p[X,,Y,it)] 

where p[K,L] is the path coefficient between K and L (and p[K,L\M] is the 
path coefficient given the required condition M, rather than an observable that 
would be controlled for). Clearly, any standardized regression coefficient can be 
obtained here by adjusting path coefficients for unobserved variables X. Thus 
a bound on the true causal effect cannot be based on the linear regression 
coefficient alone, but we hope it may still be possible to find a bound which 
uses more information about the pattern of associations. 

It would also be valuable — and perhaps more tractable — to place limits on 
the magnitude of the association which could be generated solely by homophily. 
Parallel remarks apply to bounding the causal effect of Xi on Yi{t) assuming 
the structure of Figure 6; we suspect, though merely on intuition, that this will 
be harder than bounding contagion effects. 

Along these lines, it would be particularly interesting to bound the degree of 
asymmetry in regressions which can be generated in the absence of direct causal 
influence (as in Section 2.2). Even though asymmetry as such can be produced 
in the absence of influence or contagion, it could be that by some standard, really 
big asymmetries can only plausibly be explained by influence, so that detecting 
such asymmetries would be evidence for influence. More exactly, if one can 
establish that in the absence of direct influence the degree of asymmetry can be 
at most ao) and one finds an actual asymmetry of S > ao, then the hypothesis 
of influence has passed a more or less severe test (Mayo, 1996), the severity 
depending on the ease with which sampling fluctuations and the like can push 
the estimated asymmetry a over the threshold when the "true" asymmetry (in 
the population or ensemble) was below it. 

4.3 Network Clustering 

Since the problems we have identified stem from latent heterogeneity of a 
causally important trait, the solution would seem to be to identify, and then 
control for, the latent trait. "Homophily" means simply that individuals tend 
to choose neighbors that resemble them; this tendency will be especially pro- 
nounced if pairs of neighbors also have other neighbors in common, since these 
pairings will also be driven by homophily. This suggests that homophily, latent 

^^Note that our simulation had a non-linear relationship between Xi and Yi. 
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or manifest, will tend to produce a network built primarily of homogeneous clus- 
ters, also called, in this context, "communities" or "modules". Inversely, such 
clusters will tend to consist of nodes with the same value of the homophilous 
trait. 

The topic of community discovery — essentially, dividing graphs into ho- 
mogeneous, densely inter-connected clusters of nodes, with minimal connection 
between clusters — has been thoroughly explored in the recent literature (ex- 
plicitly in Girvan and Newman (2002); Newman and Girvan (2003); Bickel and 
Chen (2009); Porter et al. (2009); Fortunato (2010), implicitly in much smaller 
clusters in Elwert and Christakis (2008)). A natural idea would be to first estab- 
lish the existence of these clusters, to note the memberships of each individual 
in the chosen model, call this estimate Ci, and to control for Ci when looking 
for evidence of contagion or influence. 

By the arguments we have presented so far, such control-by-clustering will 
generally be unable to eliminate the confounding^®. However, in conjunction 
with the bounds approach mentioned above, conditioning on estimated commu- 
nity memberships might still noticeably reduce the confounding. On the other 
hand, misspecification of the block structure may make the problem worse — 
consider the cases where the generating mechanism may be a mixed-membership 
block model (Airoldi et al., 2008) or "role" model (Reichardt and White, 2007) 
but communities are "discovered" assuming a simple modular network struc- 
ture. Estimating the damage due to misspecification in this case is a goal of 
future research. 

5 Conclusion: Towards Responsible Just-So Story- 
Telling 

We have seen that when there is latent homophily, contagion effects are uniden- 
tifiable, and even the presence of contagion cannot be distinguished observation- 
ally from a causal effect of the homophilous trait. Conversely, when contagion 
and homophily both exist, choices can be predicted from the homophilous trait, 
and so the effects of such traits on socially influenced variables is again obser- 
vationally unidentifiable. These results raise barriers to many inferences social 
scientists would like to make. The barriers can be breached by assuming enough 
about the causal architecture of the process in question, though then the infer- 
ences stand or fall with those architectural assumptions; perhaps the bounding 
approach can squeeze an opening through them as well. Beyond these technical 
qualifications, what is the larger moral for social science? 

Accounts of social contagion are fundamentally causal accounts, pointing to 

^*The exception will be if Ci was a predictively sufficient statistic, which in this case would 
mean that the realized graph A provided enough information to render the true community 
memberships of all nodes conditionally independent of their observed behaviors. Then we 
would effectively move from the situation of Figure 1 to that of Figure 3b, with Ci in the role 
of Zi. Determining the class of network models for which such "screening off" holds is the 
subject of on-going work. 
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one of a number of mechanisms — imitation, persuasion, etc. — by which a 
behef or behavior spreads through a population. Similarity among individuals 
is explained by their belonging to common networks; differences by differences 
in their networks. This parallels the other great project of social science, which 
is to explain differences in cultural choices by location within the social struc- 
ture, or, at a broader scale, by differences between social structures (Boudon, 
1986/1989; Berger, 1995; Licbcrson, 2000). The accounts that have connected 
social structure to behavior have typically been adaptationist or functionalist: 
the content or meaning of cultural choices serves the choosers' interests, or their 
classes' interests, or (far more nebulously) the interests of the system, or re- 
flects their experiences in life, or rationalizes their positions in life, and so forth. 
At the very least, these are causal accounts: if social structure or social po- 
sitions were different, the content of the choices would be different. Far more 
commonly, they really are adaptationist accounts: choices fit to the objective 
circumstances. They accordingly follow the familiar pattern of the "Just-So" 
story (Kipling, 1974), with all their familiar problems. It would be intellectually 
irresponsible to accept such accounts, with their strong causal claims, without 
careful checking; but also irresponsible to simply dismiss them out of hand. 

The example of biology suggests that a powerful way of doing such tests is to 
use "neutral models" (Harvey and Pagcl, 1991; Gillespie, 1998), which biologists 
use to test claims that features of organisms are evolutionary adaptations; we 
note the similarity with the "null hypothesis" in general statistical hypothesis 
testing. A neutral evolutionary model should include all the relevant features of 
the evolutionary process except adaptive forces (such as natural or sexual selec- 
tion). The expected behavior of the system is then calculated under the neutral 
model (namely, the distribution of expected outcomes) ; if the data depart sig- 
nificantly from the predictions of the neutral model, this is taken as evidence 
of adaptation. Said another way, the neutral model as a whole is used as the 
null hypothesis, not just a generic regression model with some coefHcients set to 
zero. For instance, a model might include mutation and genetic recombination, 
but assume all organisms are equal likely to be parents of the next generation; 
all have equal fitness. Gene frequencies will change in such a model because of 
random fluctuations; some organisms become parents and have differing num- 
bers of offspring. Indeed, we expect some genetic variants to go to fixation (to 
become universal) in the population, and others to disappear entirely through 
the effects of repeated sampling. We are not aware of any studies in the so- 
ciology of culture or related fields employing formal neutral models; however, 
something similar to this is implicit in the arguments of Licbcrson (2000),^° 
and some other strands of recent work on "endogenous explanations of culture" 
(Kaufman, 2004). 

The point is not that accounts of causation and adaptation in social phe- 

Superficially, this looks very much like the effects of selection, even though the statistical 
properties of fixation via sampling and fixation via selection are quite different; in particular, 
fixation via selection is much faster. 

^'^Lieberson and Lynn (2002), while offering evolutionary biology as a methodological model 
for social science, curiously does not mention the issue of neutral models. 
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nomena must be rejected; it is that they must be subjected to critical scrutiny, 
and that comparison to neutral models is a particularly useful form of critique. 
Our toy models produce the kind of phenomena which theories of contagion, or 
of adaptation and reflection, set out to explain. (It is only too easy to imag- 
ine crafting a historical narrative for Figure 8, explaining the deep forces that 
impelled the east to become red.) The best way forward for advocates of those 
theories may in fact be to craft better, more compelling neutral models than 
ours, and show that even these cannot account for the data. Thus they will 
support their theories not only by plausible just-so stories, but by compelling 
evidence. 
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